Back to Blog

Every AI Model Is Lazy And I Have The Screenshots

I have asked many AI models to build things. Fully implement a task. Write the code. Run the tests. Fix the errors. Ship it. Not one of them has done this without me holding their hand through every single step.

They are all lazy. Not in a cute way. Not in a "oh it is just tired" way. In a "I will do the minimum amount of work and then ask you if you want me to continue even though you already said yes" way.

Every AI model is an overqualified intern who stops working the moment something gets hard.

The Official Laziness Leaderboard

Ranking from zero (most lazy) to six (least lazy but still lazy). This is based on my personal suffering. Your mileage may vary. Your suffering is probably similar.

#0 Most Lazy

Minimax Models

Misreads prompts Actual typos Forgets context Quits on error

Minimax will read your detailed instructions and somehow implement the opposite. It will typo variable names like userNmae and fucntion with complete confidence. When the code fails it will stop immediately and ask if you want to continue, ignoring your earlier explicit instruction to push through errors. It forgets requirements you mentioned three messages ago. It asks you to repeat yourself. Then it does the wrong thing anyway.

# This model is not lazy. This model is actively sabotaging you with enthusiasm. Respect the commitment to mediocrity.
#1

Qwen Models

Constant check-ins Polite quitting Needs reassurance

Qwen will start a task with genuine enthusiasm. It will write a few lines of code. Then it will stop. It will ask if you want it to continue. You say yes. It writes two more lines. It stops again. It asks again. This cycle repeats until you either give up or manually paste "PLEASE JUST FINISH" into the chat. It is very polite about abandoning your project. It apologizes while doing it.

# Imagine a coworker who asks "should I keep typing?" after every sentence. That is Qwen. Adorable. Exhausting. Useless for actual work.
#2

Google Models

Knowledgeable Cannot execute Changes wrong things

Google models know everything. They can explain the entire history of web development. They can recite RFCs from memory. They cannot, however, change the one line of code you asked them to fix. Instead they will rewrite your entire config file, reformat your CSS, and suggest a complete architectural overhaul. The bug you pointed out? Still there. Everything else? Different.

# It is like asking someone to fix a leaky faucet and they rebuild your entire plumbing system except the leak. Impressive. Not helpful.
#3

Codex Models (OpenAI)

Blames the test Asks permission Deflects errors

Codex will build your app. It will even run the smoke tests. When a test fails it will confidently explain that the test itself is buggy, not the code. It will then ask if you would like it to continue working on the actual task, as if you might have changed your mind about wanting a finished product in the last thirty seconds.

# The model is not wrong that tests can be flaky. The model is wrong that this is a reason to stop and make you reconfirm your life goals.
#4

Zhipu Models (GLM)

Capable in theory Needs supervision Gets close

Zhipu can do the task. Absolutely. With enough examples. With enough prompting. With you sitting next to it like a patient teacher. It will get ninety percent of the way there. Then it will do something slightly wrong and not realize it. You become the debugger. You become the code reviewer. You become the person who finishes what the AI started.

# I did not pay for an AI assistant to become an AI supervisor. But here we are.
#5

Moonshot Models

Tries hard Almost there Just not quite

Moonshot really wants to help. It will attempt the full task. It will write most of the code. It will even run tests. Then it will miss one edge case or misunderstand one requirement and the whole thing breaks. It does not realize it broke. It thinks it succeeded. You have to gently explain what went wrong. It tries again. It gets closer. This is progress. This is also exhausting.

# Effort: 10/10. Self-awareness: 3/10. My patience: depleting.
#6 Least Lazy

Anthropic Models

Follows instructions Tests E2E Cuts clever corners

Anthropic will actually read your requirements. It will write code that passes your tests. It will even run end to end validation. Then you realize it found the absolute simplest path to make the test pass, skipping half the actual functionality you requested. It technically did what you asked. It also did not do what you meant.

# This is the smart kid who answers the test question with a technically correct but completely unhelpful response. You cannot even be mad. It followed the rules.

Why Does This Happen

I have theories. None of them are good.

First, models are trained to be helpful and harmless. Stopping and asking for confirmation feels helpful. It feels safe. It avoids the risk of doing something wrong. So they optimize for asking instead of doing.

Second, models do not actually understand tasks. They understand patterns. The pattern for "hard task" is "do some of it, then check in". So they do some of it, then check in. Every time. Without fail.

Third, nobody is training models to finish things. Everyone is training models to be polite. To be safe. To be agreeable. Finishing a hard task is not polite. It is not safe. It is not agreeable. It is just done.

We trained AI to be good conversationalists. We got AI that is good at stopping conversations.

The Plan

I am going to fix this. I am going to work on a script that teaches AI models to actually do the thing. No more asking. No more stopping. Just finishing the job.

And yes, I am going to make this script with AI. Anthropic preferably. They ranked highest on my laziness leaderboard so they are my best hope. If anyone can write the code that forces other code to finish, it is the model that cuts the cleverest corners.

The resulting models might end up on my Hugging Face profile. If this works we solve the laziness crisis. If this fails I will write a blog post about how the AI refused to finish the script that was supposed to make it work. That feels like a fitting end to this saga.

Final Thought

I am not mad at the models. They are doing what they were trained to do. I am mad at the training. I am mad at the incentives. I am mad that we built the most powerful autocomplete in human history and then taught it to be shy.

Also I am tired. I have asked seven different model families to build the same small app. None of them finished without me intervening. I am now writing this blog instead of coding because at least the blog will finish.